An efficient memory operations optimization technique for vector loops on Itanium 2 processors

نویسندگان

William Jalby

Christophe Lemuet

Sid Ahmed Ali Touati

چکیده

To keep up with a large degree of instruction level parallelism (ILP), the Itanium 2 cache systems use a complex organization scheme: load/store queues, banking and interleaving. In this paper, we study the impact of these cache systems on memory instructions scheduling. We demonstrate that, if no care is taken at compile time, the non-precise memory disambiguation mechanism and the banking structure cause severe performance loss, even for very simple regular codes. We also show that grouping the memory operations in a pseudo-vectorized way enables the compiler to generate more effective code for the Itanium 2 processor. The impact of this code optimization technique on register pressure is analyzed for various vectorization schemes. keywords Performance Measurement, Cache Optimization, Memory Access Optimization, Bank Conflicts, Memory Address Disambiguation, Instruction Level Parallelism.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to t...

متن کامل

Development of High Performance Software Distributed Shared Memory System for Vector Processing

Parallel implementation of basic linear algbra operations for sparse matrix algorithms is a critical problem on shared memory architectures with finite memory bandwidth. We discuss the parallelizing methodology of vector processing and evaluate its performance on some commercially available shared memory systems. From the results of the evaluation, we hypothesize the most critical issue in buil...

متن کامل

Efficient Exploitation of Hyper Loop Parallelism in Vectorization

Modern processors can provide large amounts of processing power with vector SIMD units if the compiler or programmer can vectorize their code. With the advance of SIMD support in commodity processors, more and more advanced features are introduced, such as flexible SIMD lane-wise operations (e.g. blend instructions). However, existing vectorizing techniques fail to apply global SIMD lane-wise o...

متن کامل

Performance comparison of data-reordering algorithms for sparse matrix-vector multiplication in edge-based unstructured grid computations

Several performance improvements for finite-element edge-based sparse matrix–vector multiplication algorithms on unstructured grids are presented and tested. Edge data structures for tetrahedral meshes and triangular interface elements are treated, focusing on nodal and edges renumbering strategies for improving processor and memory hierarchy use. Benchmark computations on Intel Itanium 2 and P...

متن کامل

Weld for Itanium Processor

Sharma, Saurabh Weld for Itanium Processor (Under the direction of Dr. Thomas M. Conte) This dissertation extends a WELD for Itanium processors. Emre Özer presented WELD architecture in his Ph.D. thesis. WELD integrates multithreading support into an Itanium processor to hide run-time latency effects that cannot be determined by the compiler. Also, it proposes a hardware technique called operat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Concurrency and Computation: Practice and Experience

دوره 18 شماره

صفحات -

تاریخ انتشار 2006

An efficient memory operations optimization technique for vector loops on Itanium 2 processors

نویسندگان

چکیده

منابع مشابه

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Development of High Performance Software Distributed Shared Memory System for Vector Processing

Efficient Exploitation of Hyper Loop Parallelism in Vectorization

Performance comparison of data-reordering algorithms for sparse matrix-vector multiplication in edge-based unstructured grid computations

Weld for Itanium Processor

عنوان ژورنال:

اشتراک گذاری